A Corpus-Based Approach to Topic in Danish dialog

نویسندگان

  • Philip Diderichsen
  • Jakob Elming
چکیده

We report on an investigation of the pragmatic category of topic in Danish dialog and its correlation to surface features of NPs. Using a corpus of 444 utterances, we trained a decision tree system on 16 features. The system achieved nearhuman performance with success rates of 84–89% and F1-scores of 0.63–0.72 in 10fold cross validation tests (human performance: 89% and 0.78). The most important features turned out to be preverbal position, definiteness, pronominalisation, and non-subordination. We discovered that NPs in epistemic matrix clauses (e.g. “I think . . . ”) were seldom topics and we suspect that this holds for other interpersonal matrix clauses as well.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A rule based approach to extraction of topics and dialog acts in a spoken dialog system

This paper presents a rule based approach to extraction of dialog acts and topics from utterances in a spoken dialog system, SDSKIT-3, with a task-independent dialog controller which is based on an extension of the framedriven method. We demonstrated it could control dialogs in several different task domains, only given a set of topic frames and a set of rules manually designed for the discours...

متن کامل

Topic Spotting Using Subword Units

Geh ort zum Antragsabschnitt: 4.8 Das diesem Bericht zugrundeliegende Forschungsvorhaben wurde mit Mitteln des Bundesministers f ur Bildung, Wissenschaft, Forschung und Technologie unter dem F orderkennzeichen 01 IV 701 K/5 gef ordert. Die Verantwortung f ur den Inhalt dieser Arbeit liegt bei den Autoren. Abstract In this paper we present a new approach for topic spotting based on subword units...

متن کامل

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Extraction of the dialog act and the topic from utterances in a spoken dialog system

This paper presents an approach to extraction of dialog acts and topics from utterances in a spoken dialog system. Two knowledge sources are used to describe the dialog history. One is a transition network of dialog acts and the other is a tree of topics which might appear in domain communications. Dialog acts and topics are extracted through bottom-up and top-down analyses. Bottom-up candidate...

متن کامل

Annotations and Tools for an Activity Based Spoken Language Corpus

The paper contains a description of the Spoken Language Corpus of Swedish at the Department of Linguistics, Göteborg University (GSLC), and a summary of the various types of analysis and tools that have been developed for work on this corpus. Work on the corpus was started in the late 1970:s. It is incrementally growing and presently consists of 1.3 million words from about 25 different social ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005